Finding top-k elements in data streams

نویسندگان

  • Nuno Homem
  • João Paulo Carvalho
چکیده

Identifying the most frequent elements in a data stream is a well known and difficult problem. Identifying the most frequent elements for each individual, especially in very large populations, is even harder. The use of fast and small memory footprint algorithms is paramount when the number of individuals is very large. In many situations such analysis needs to be performed and kept up to date in near real time. Fortunately, approximate answers are usually adequate when dealing with this problem. This paper presents a new and innovative algorithm that addresses this problem by merging the commonly used counter-based and sketch-based techniques for top-k identification. The algorithm provides the top-k list of elements, their frequency and an error estimate for each frequency value. It also provides strong guarantees on the error estimate, order of elements and inclusion of elements in the list depending on their real frequency. Additionally the algorithm provides stochastic bounds on the error and expected error estimates. Telecommunications customer’s behavior and voice call data is used to present concrete results obtained with this algorithm and to illustrate improvements over previously existing algorithms. 2010 Elsevier Inc. All rights reserved.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Efficient Computation of Frequent and Top-k Elements in Data Streams

We propose an integrated approach for solving both problems of finding the most popular k elements, and finding frequent elements in a data stream. Our technique is efficient and exact if the alphabet under consideration is small. In the more practical large alphabet case, our solution is space efficient and reports both top-k and frequent elements with tight guarantees on errors. For general d...

متن کامل

Finding top-k elements in a time-sliding window

Identifying the top-k most frequent elements is one of the many problems associated with data streams analysis. It is a well-known and difficult problem, especially if the analysis is to be performed and maintained up to date in near real time. Analyzing data streams in time sliding window model is of particular interest as only the most recent, more relevant events are considered. Approximate ...

متن کامل

Mining top-k high utility patterns over data streams

Online high utility itemset mining over data streams has been studied recently. However, the existing methods are not designed for producing topk patterns. Since there could be a large number of high utility patterns, finding only top-k patterns is more attractive than producing all the patterns whose utility is above a threshold. A challenge with finding top-k high utility itemsets over data s...

متن کامل

Matching Top-k Answers of Twig Patterns in Probabilistic XML

The flexibility of XML data model allows a more natural representation of uncertain data compared with the relational model. The top-k matching of a twig pattern against probabilistic XML data is essential. Some classical twig pattern algorithms can be adjusted to process the probabilistic XML. However, as far as finding answers of the top-k probabilities is concerned, the existing algorithms s...

متن کامل

How to select the largest k elements from evolving data?

In this paper we investigate the top-k-selection problem, i.e. determine the largest, second largest, ..., and the k-th largest elements, in the dynamic data model. In this model the order of elements evolves dynamically over time. In each time step the algorithm can only probe the changes of data by comparing a pair of elements. Previously only two special cases were studied [2]: finding the l...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:
  • Inf. Sci.

دوره 180  شماره 

صفحات  -

تاریخ انتشار 2010